Normal Distribution (norm)#
The normal (Gaussian) distribution is the canonical model for additive noise and aggregated effects.
It appears throughout statistics and machine learning via the Central Limit Theorem, as the distribution of measurement errors, and as the maximum-entropy distribution under mean/variance constraints.
Notebook roadmap#
Title & classification
Intuition & motivation
Formal definition (PDF/CDF)
Moments & properties
Parameter interpretation
Derivations (\(\mathbb{E}[X]\), \(\mathrm{Var}(X)\), likelihood)
Sampling & simulation (NumPy-only)
Visualization (PDF, CDF, Monte Carlo)
SciPy integration (
scipy.stats.norm)Statistical use cases
Pitfalls
Summary
import math
import numpy as np
import scipy
from scipy import special, stats
import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
SEED = 7
rng = np.random.default_rng(SEED)
np.set_printoptions(precision=4, suppress=True)
print("numpy ", np.__version__)
print("scipy ", scipy.__version__)
print("plotly", plotly.__version__)
numpy 1.26.2
scipy 1.15.0
plotly 6.5.2
Prerequisites & notation#
Prerequisites
comfort with basic calculus (integration by parts)
basic probability (PDF/CDF, expectation, likelihood)
Notation
\(X \sim \mathcal{N}(\mu, \sigma^2)\) means: mean \(\mu\in\mathbb{R}\), standard deviation \(\sigma>0\).
\(Z \sim \mathcal{N}(0,1)\) denotes the standard normal.
\(\varphi\) and \(\Phi\) denote the standard normal PDF and CDF.
SciPy uses a location–scale parameterization: stats.norm(loc=μ, scale=σ).
1) Title & classification#
Name:
norm(Normal / Gaussian distribution)Type: continuous
Support: \(x \in (-\infty, \infty)\)
Parameter space:
location (mean): \(\mu \in \mathbb{R}\)
scale (std dev): \(\sigma \in (0, \infty)\)
Equivalent parameterizations you’ll also see:
variance \(\sigma^2 > 0\)
precision \(\tau = 1/\sigma^2 > 0\)
2) Intuition & motivation#
What it models#
The normal distribution often models the sum of many small, independent effects. A classic mental model is measurement error:
\(\text{observed} = \text{true signal} + \text{noise}\), where the noise is approximately Gaussian.
Two key reasons it shows up so often:
Central Limit Theorem (CLT): standardized sums of many weakly dependent variables tend toward a normal distribution.
Maximum entropy: among all continuous distributions with a fixed mean and variance, the normal has the largest differential entropy (it is the “least informative” choice under those constraints).
Typical real-world use cases#
Sensors & experiments: additive noise in physical measurements
Averages/aggregates: sampling distributions of means (often approximately normal)
Error models: regression residuals, Kalman filters, Gaussian processes
Latent-variable models: Gaussian priors and Gaussian likelihoods (conjugacy)
Relations to other distributions#
Standardization: if \(X \sim \mathcal{N}(\mu,\sigma^2)\), then \((X-\mu)/\sigma \sim \mathcal{N}(0,1)\).
Chi-square: if \(Z \sim \mathcal{N}(0,1)\), then \(Z^2 \sim \chi^2_1\).
Additivity: sums of independent normals are normal (means/variances add).
Student-\(t\): arises from a normal divided by a chi-square term.
Lognormal: if \(Y \sim \mathcal{N}(\mu,\sigma^2)\), then \(\exp(Y)\) is lognormal.
3) Formal definition#
Let \(X \sim \mathcal{N}(\mu, \sigma^2)\) with \(\mu\in\mathbb{R}\) and \(\sigma>0\).
PDF#
[ f(x\mid\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}},\exp!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right),\qquad x\in\mathbb{R}. ]
For the standard normal \(Z\sim\mathcal{N}(0,1)\), the PDF is [ \varphi(z) = \frac{1}{\sqrt{2\pi}},e^{-z^2/2}. ]
CDF#
The CDF is [ F(x\mid\mu,\sigma) = \mathbb{P}(X\le x) = \Phi!\left(\frac{x-\mu}{\sigma}\right), ] where \(\Phi\) is the standard normal CDF.
There is no elementary closed form, but it can be written using the error function: [ \Phi(z) = \tfrac{1}{2}\left(1 + \operatorname{erf}!\left(\tfrac{z}{\sqrt{2}}\right)\right). ]
4) Moments & properties#
For \(X \sim \mathcal{N}(\mu, \sigma^2)\):
Moments#
Mean: \(\mathbb{E}[X] = \mu\)
Variance: \(\mathrm{Var}(X) = \sigma^2\)
Skewness: \(0\) (symmetry)
Kurtosis: \(3\) (excess kurtosis \(0\))
Median / mode: \(\mu\)
MGF and characteristic function#
MGF (all real \(t\)): [ M_X(t) = \mathbb{E}[e^{tX}] = \exp!\left(\mu t + \tfrac{1}{2}\sigma^2 t^2\right). ]
Characteristic function: [ \varphi_X(t) = \mathbb{E}[e^{itX}] = \exp!\left(i\mu t - \tfrac{1}{2}\sigma^2 t^2\right). ]
Entropy (differential, in nats)#
[ H(X) = \tfrac{1}{2}\ln!\left(2\pi e,\sigma^2\right). ]
Other notable properties#
Affine invariance: if \(Y=aX+b\), then \(Y\) is normal with mean \(a\mu+b\) and variance \(a^2\sigma^2\).
Additivity: sums of independent normals are normal (and covariances add in the multivariate case).
Maximum entropy under fixed mean/variance constraints.
SQRT_2PI = math.sqrt(2.0 * math.pi)
def norm_pdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
x = np.asarray(x, dtype=float)
if scale <= 0:
raise ValueError("scale must be > 0")
z = (x - loc) / scale
return np.exp(-0.5 * z**2) / (scale * SQRT_2PI)
def norm_cdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
x = np.asarray(x, dtype=float)
if scale <= 0:
raise ValueError("scale must be > 0")
z = (x - loc) / scale
return special.ndtr(z)
def norm_logpdf(x: np.ndarray, loc: float = 0.0, scale: float = 1.0) -> np.ndarray:
x = np.asarray(x, dtype=float)
if scale <= 0:
raise ValueError("scale must be > 0")
z = (x - loc) / scale
return -0.5 * z**2 - math.log(scale) - 0.5 * math.log(2.0 * math.pi)
def norm_loglik(loc: float, scale: float, x: np.ndarray) -> float:
x = np.asarray(x, dtype=float)
if scale <= 0 or np.any(~np.isfinite(x)):
return -np.inf
return float(np.sum(norm_logpdf(x, loc=loc, scale=scale)))
def norm_mle(x: np.ndarray) -> tuple[float, float]:
"""MLE for (μ, σ) under iid N(μ, σ²).
Note: the MLE for σ uses ddof=0 (biased as an estimator of σ).
"""
x = np.asarray(x, dtype=float)
mu_hat = float(np.mean(x))
sigma_hat = float(np.sqrt(np.mean((x - mu_hat) ** 2)))
return mu_hat, sigma_hat
def sample_norm_box_muller(
n: int,
loc: float = 0.0,
scale: float = 1.0,
rng: np.random.Generator | None = None,
) -> np.ndarray:
"""NumPy-only sampling via the Box–Muller transform.
Returns n iid samples from N(loc, scale^2).
"""
if rng is None:
rng = np.random.default_rng()
if n < 0:
raise ValueError("n must be >= 0")
if scale <= 0:
raise ValueError("scale must be > 0")
m = (n + 1) // 2 # number of (Z0, Z1) pairs
u1 = rng.random(m)
u2 = rng.random(m)
# Avoid log(0) when u1 is exactly 0.
u1 = np.maximum(u1, np.nextafter(0.0, 1.0))
r = np.sqrt(-2.0 * np.log(u1))
theta = 2.0 * math.pi * u2
z0 = r * np.cos(theta)
z1 = r * np.sin(theta)
z = np.empty(2 * m, dtype=float)
z[0::2] = z0
z[1::2] = z1
z = z[:n]
return loc + scale * z
5) Parameter interpretation#
Location \(\mu\)#
Shifts the distribution left/right.
\(\mu\) is the center of symmetry, and it equals the mean/median/mode.
Scale \(\sigma\)#
Controls dispersion: larger \(\sigma\) spreads mass out and lowers the peak.
About 68% / 95% / 99.7% of mass lies within \(\mu \pm 1\sigma\), \(\mu \pm 2\sigma\), \(\mu \pm 3\sigma\) (the “68–95–99.7 rule”).
Shape changes#
All normal PDFs are bell-shaped and symmetric; changing \(\mu\) shifts the bell, changing \(\sigma\) changes its width.
x = np.linspace(-8, 8, 800)
params = [
(0.0, 1.0),
(0.0, 2.0),
(1.5, 1.0),
(-2.0, 0.6),
]
fig = go.Figure()
for mu, sigma in params:
fig.add_trace(
go.Scatter(
x=x,
y=norm_pdf(x, loc=mu, scale=sigma),
mode="lines",
name=f"μ={mu:g}, σ={sigma:g}",
)
)
fig.add_vline(x=mu, line_dash="dot", opacity=0.25)
fig.update_layout(title="Normal PDFs for different (μ, σ)", xaxis_title="x", yaxis_title="f(x)")
fig.show()
6) Derivations#
We derive \(\mathbb{E}[X]\), \(\mathrm{Var}(X)\), and the likelihood/MLE.
Expectation#
For the standard normal \(Z\sim\mathcal{N}(0,1)\) with PDF \(\varphi(z)\), [ \mathbb{E}[Z] = \int_{-\infty}^{\infty} z,\varphi(z),dz. ] The integrand \(z\,\varphi(z)\) is an odd function (since \(\varphi\) is even), so the integral over a symmetric domain is \(0\).
For \(X = \mu + \sigma Z\): [ \mathbb{E}[X] = \mu + \sigma,\mathbb{E}[Z] = \mu. ]
Variance#
First compute \(\mathbb{E}[Z^2]\): [ \mathbb{E}[Z^2] = \int_{-\infty}^{\infty} z^2,\varphi(z),dz. ] Use the fact that \(\varphi'(z) = -z\,\varphi(z)\), so \(z\,\varphi(z) = -\varphi'(z)\). Then [ \mathbb{E}[Z^2] = \int z^2\varphi(z),dz = -\int z,\varphi’(z),dz. ] Integrate by parts with \(u=z\) and \(dv=\varphi'(z)\,dz\): [ -\int z,\varphi’(z),dz = -\big[z,\varphi(z)\big]_{-\infty}^{\infty} + \int \varphi(z),dz. ] The boundary term is \(0\) because \(z\,\varphi(z)\to 0\) as \(|z|\to\infty\), and \(\int \varphi(z)\,dz = 1\). Hence \(\mathbb{E}[Z^2]=1\), so \(\mathrm{Var}(Z)=1\).
For \(X=\mu+\sigma Z\): [ \mathrm{Var}(X) = \sigma^2,\mathrm{Var}(Z) = \sigma^2. ]
Likelihood and MLE#
For iid data \(x_1,\dots,x_n\) from \(\mathcal{N}(\mu,\sigma^2)\), the likelihood is [ L(\mu,\sigma) = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}}\exp!\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right). ] The log-likelihood is [ \ell(\mu,\sigma) = -n\ln\sigma - \tfrac{n}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2. ] Setting derivatives to zero gives the MLEs: [ \hat\mu = \bar x,\qquad \hat\sigma^2 = \frac{1}{n}\sum_{i=1}^n (x_i-\bar x)^2. ] (The familiar unbiased sample variance uses \(n-1\) instead of \(n\).)
# MLE demo on simulated data
true_mu = 1.5
true_sigma = 0.8
n = 600
x = sample_norm_box_muller(n, loc=true_mu, scale=true_sigma, rng=rng)
mu_hat, sigma_hat = norm_mle(x)
loglik_true = norm_loglik(true_mu, true_sigma, x)
loglik_hat = norm_loglik(mu_hat, sigma_hat, x)
true_mu, true_sigma, mu_hat, sigma_hat, loglik_true, loglik_hat
(1.5,
0.8,
1.4563135974860988,
0.7969433986282367,
-716.0835277507142,
-715.1801474751153)
7) Sampling & simulation (NumPy-only)#
Box–Muller transform#
Let \(U_1, U_2 \sim \mathrm{Uniform}(0,1)\) iid. Define [ R = \sqrt{-2\ln U_1},\qquad \Theta = 2\pi U_2. ] Then [ Z_0 = R\cos\Theta,\qquad Z_1 = R\sin\Theta ] are iid \(\mathcal{N}(0,1)\). Finally, to sample \(X\sim\mathcal{N}(\mu,\sigma^2)\), return \(X = \mu + \sigma Z\).
Numerical note: if \(U_1=0\), then \(\ln U_1\) is undefined, so we clip \(U_1\) away from 0.
# Sampling: compare histogram to the true PDF
mu = 0.7
sigma = 1.3
n = 60_000
samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)
x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 500)
fig = px.histogram(
samples,
nbins=70,
histnorm="probability density",
title=f"Monte Carlo samples vs PDF (n={n}, μ={mu:g}, σ={sigma:g})",
labels={"value": "x"},
)
fig.add_trace(go.Scatter(x=x_grid, y=norm_pdf(x_grid, mu, sigma), mode="lines", name="true pdf"))
fig.update_layout(yaxis_title="density")
fig.show()
samples.mean(), samples.std(ddof=0)
(0.7031752075013465, 1.2937846790607508)
8) Visualization (PDF, CDF, Monte Carlo)#
We’ll visualize:
the PDF for multiple \(\sigma\) values
the CDF and an empirical CDF from Monte Carlo samples
# PDF and CDF for multiple scales
mu = 0.0
sigmas = [0.5, 1.0, 2.0]
x = np.linspace(-8, 8, 800)
fig_pdf = go.Figure()
fig_cdf = go.Figure()
for s in sigmas:
fig_pdf.add_trace(go.Scatter(x=x, y=norm_pdf(x, mu, s), mode="lines", name=f"σ={s:g}"))
fig_cdf.add_trace(go.Scatter(x=x, y=norm_cdf(x, mu, s), mode="lines", name=f"σ={s:g}"))
fig_pdf.update_layout(title="Normal PDF (μ=0)", xaxis_title="x", yaxis_title="f(x)")
fig_cdf.update_layout(title="Normal CDF (μ=0)", xaxis_title="x", yaxis_title="F(x)")
fig_pdf.show()
fig_cdf.show()
# Empirical CDF vs true CDF
mu = -0.5
sigma = 1.2
n = 25_000
samples = sample_norm_box_muller(n, loc=mu, scale=sigma, rng=rng)
xs = np.sort(samples)
ys = np.arange(1, n + 1) / n
x_grid = np.linspace(mu - 4.5 * sigma, mu + 4.5 * sigma, 600)
fig = go.Figure()
fig.add_trace(go.Scatter(x=xs, y=ys, mode="lines", name="empirical CDF"))
fig.add_trace(go.Scatter(x=x_grid, y=norm_cdf(x_grid, mu, sigma), mode="lines", name="true CDF"))
fig.update_layout(
title=f"Empirical CDF vs true CDF (n={n}, μ={mu:g}, σ={sigma:g})",
xaxis_title="x",
yaxis_title="F(x)",
)
fig.show()
9) SciPy integration (scipy.stats.norm)#
SciPy’s norm is parameterized as stats.norm(loc=μ, scale=σ).
Useful methods include:
pdf,logpdfcdf,sf(survival function), and the numerically stablelogcdf,logsfppf(quantiles)rvs(sampling)fit(MLE fitting)
mu = 0.7
sigma = 1.3
dist = stats.norm(loc=mu, scale=sigma)
x = np.linspace(mu - 3 * sigma, mu + 3 * sigma, 7)
pdf_vals = dist.pdf(x)
cdf_vals = dist.cdf(x)
# Sampling
samples = dist.rvs(size=5, random_state=rng)
# Fit (MLE)
big_sample = dist.rvs(size=5_000, random_state=rng)
mu_fit, sigma_fit = stats.norm.fit(big_sample)
x, pdf_vals, cdf_vals, samples, (mu_fit, sigma_fit)
(array([-3.2, -1.9, -0.6, 0.7, 2. , 3.3, 4.6]),
array([0.0034, 0.0415, 0.1861, 0.3069, 0.1861, 0.0415, 0.0034]),
array([0.0013, 0.0228, 0.1587, 0.5 , 0.8413, 0.9772, 0.9987]),
array([-0.3951, 1.3869, 0.9551, 0.3398, -0.1837]),
(0.6852965688237751, 1.3020645349649365))
# Tail-stability: logcdf/logsf vs log(cdf/sf)
z = -40.0
cdf_direct = stats.norm.cdf(z)
logcdf_stable = stats.norm.logcdf(z)
z2 = 40.0
sf_direct = stats.norm.sf(z2)
logsf_stable = stats.norm.logsf(z2)
(cdf_direct, logcdf_stable), (sf_direct, logsf_stable)
((0.0, -804.6084420137539), (0.0, -804.6084420137539))
10) Statistical use cases#
Hypothesis testing (z-test for a mean, \(\sigma\) known)#
If \(X_1,\dots,X_n \sim \mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\), then under \(H_0: \mu=\mu_0\), [ Z = \frac{\bar X - \mu_0}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1). ] A two-sided p-value is \(p = 2\,\mathbb{P}(|Z|\ge |z_{obs}|)\).
Bayesian modeling (Normal–Normal conjugacy for a mean, \(\sigma\) known)#
Prior: \(\mu \sim \mathcal{N}(\mu_0,\tau_0^2)\). Likelihood: \(X_i\mid\mu \sim \mathcal{N}(\mu,\sigma^2)\) with known \(\sigma\).
Posterior: \(\mu\mid x \sim \mathcal{N}(\mu_n,\tau_n^2)\) where [ \tau_n^2 = \left(\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}\right)^{-1},\qquad \mu_n = \tau_n^2\left(\tfrac{\mu_0}{\tau_0^2} + \tfrac{n\bar x}{\sigma^2}\right). ]
Generative modeling#
Normals are building blocks for generative models:
Linear Gaussian models (e.g., Kalman filters): Gaussian latent states + Gaussian noise
Gaussian mixtures (GMMs): weighted sums of normals for multi-modal densities
Multivariate normal: correlated features via linear transforms of independent normals
# Hypothesis test example: two-sided z-test for a mean (σ known)
mu0 = 0.0
sigma_known = 2.0
n = 40
# Simulated measurements with true mean != mu0
true_mu = 0.9
data = sample_norm_box_muller(n, loc=true_mu, scale=sigma_known, rng=rng)
xbar = data.mean()
z_obs = (xbar - mu0) / (sigma_known / math.sqrt(n))
p_two_sided = 2.0 * stats.norm.sf(abs(z_obs))
alpha = 0.05
z_crit = stats.norm.ppf(1 - alpha / 2)
ci = (
xbar - z_crit * sigma_known / math.sqrt(n),
xbar + z_crit * sigma_known / math.sqrt(n),
)
xbar, z_obs, p_two_sided, ci
(1.1249403105684774,
3.5573736131335747,
0.0003745812231249413,
(0.5051452782639159, 1.744735342873039))
# Bayesian update for μ with known σ (Normal–Normal)
mu0 = 0.0
tau0 = 1.5 # prior std dev
sigma = sigma_known
xbar = data.mean()
tau_n2 = 1.0 / (1.0 / tau0**2 + n / sigma**2)
mu_n = tau_n2 * (mu0 / tau0**2 + n * xbar / sigma**2)
tau_n = math.sqrt(tau_n2)
mu_n, tau_n
(1.077070510118755, 0.309426373877638)
# Visualize prior vs posterior over μ
mu_grid = np.linspace(mu_n - 5 * tau0, mu_n + 5 * tau0, 600)
prior = stats.norm(loc=mu0, scale=tau0)
post = stats.norm(loc=mu_n, scale=tau_n)
fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_grid, y=prior.pdf(mu_grid), mode="lines", name="prior"))
fig.add_trace(go.Scatter(x=mu_grid, y=post.pdf(mu_grid), mode="lines", name="posterior"))
fig.update_layout(title="Bayesian update for μ (σ known)", xaxis_title="μ", yaxis_title="density")
fig.show()
# Generative modeling example: 2D correlated Gaussian via a linear transform
n = 3_000
mu_vec = np.array([1.0, -1.0])
Sigma = np.array([[1.0, 0.8], [0.8, 2.0]])
L = np.linalg.cholesky(Sigma)
z = sample_norm_box_muller(2 * n, loc=0.0, scale=1.0, rng=rng).reshape(n, 2)
x = mu_vec + z @ L.T
df = {"x1": x[:, 0], "x2": x[:, 1]}
fig = px.scatter(df, x="x1", y="x2", opacity=0.35, title="Samples from a correlated 2D Gaussian")
fig.update_layout(xaxis_title="x1", yaxis_title="x2")
fig.show()
x.mean(axis=0), np.cov(x.T)
(array([ 0.9797, -1.0441]),
array([[1.0095, 0.8152],
[0.8152, 1.9817]]))
11) Pitfalls#
Invalid parameters: \(\sigma\le 0\) is not allowed. In code, guard against non-positive
scale.Overconfidence in normality: real data may be skewed, heavy-tailed, or multi-modal. Diagnose with histograms/QQ-plots; consider alternatives (e.g., Student-\(t\), mixtures, robust losses).
Outliers: Gaussian likelihoods heavily penalize large residuals, so a few outliers can dominate fits.
Numerical issues in the tails:
cdf/sfmay underflow to 0; preferlogcdf/logsfor work in log-space.Sampling edge cases: Box–Muller requires \(U_1>0\); clip
u1away from 0 to avoidlog(0).
12) Summary#
normis a continuous distribution on \(( -\infty,\infty )\) with parameters \(\mu\in\mathbb{R}\), \(\sigma>0\).PDF: bell-shaped and symmetric; \(\mu\) shifts, \(\sigma\) spreads.
Key formulas: \(\mathbb{E}[X]=\mu\), \(\mathrm{Var}(X)=\sigma^2\), \(M_X(t)=\exp(\mu t + \tfrac12\sigma^2 t^2)\), \(H=\tfrac12\ln(2\pi e\sigma^2)\).
MLE: \(\hat\mu=\bar x\), \(\hat\sigma^2 = \tfrac1n\sum(x_i-\bar x)^2\).
For tails, prefer
stats.norm.logcdf/logsfover takinglogofcdf/sf.